Machine Learning-Assisted Computational Screening of Metal-Organic Frameworks for Atmospheric Water Harvesting

Atmospheric water harvesting by strong adsorbents is a feasible method of solving the shortage of water resources, especially for arid regions. In this study, a machine learning (ML)-assisted high-throughput computational screening is employed to calculate the capture of H2O from N2 and O2 for 6013 computation-ready, experimental metal-organic frameworks (CoRE-MOFs) and 137,953 hypothetical MOFs (hMOFs). Through the univariate analysis of MOF structure-performance relationships, Qst is shown to be a key descriptor. Moreover, three ML algorithms (random forest, gradient boosted regression trees, and neighbor component analysis (NCA)) are applied to hunt for the complicated interrelation between six descriptors and performance. After the optimizing strategy of grid search and five-fold cross-validation is performed, three ML can effectively build the predictive model for CoRE-MOFs, and the accuracy R2 of NCA can reach 0.97. In addition, based on the relative importance of the descriptors by ML, it can be quantitatively concluded that the Qst is dominant in governing the capture of H2O. Besides, the NCA model trained by 6013 CoRE-MOFs can predict the selectivity of hMOFs with a R2 of 0.86, which is more universal than other models. Finally, 10 CoRE-MOFs and 10 hMOFs with high performance are identified. The computational screening and prediction of ML could provide guidance and inspiration for the development of materials for water harvesting in the atmosphere.


Induction
As we all know, 71% of the earth's surface is covered by water, and the remaining 29% is land. At first glance, we have a lot of water resources; in fact, the water we use is mainly fresh water, and fresh water only accounts for 2.5% of all water resources on the earth [1]. Among them, approximately 69% of the fresh water is enclosed in the ice layer of Antarctica and Greenland, and the remaining 30% is stored in the ground, so the fresh water (such as river water and fresh water lakes) that humans can directly use accounts for only 0.4% of all water resources [1]. As population growth and living standards improve, water resources are becoming increasingly scarce, especially daily water for residents of arid regions. Currently, one third of the world's population live in regions with medium and high water shortages. It is estimated that two thirds of the world will face water shortages by 2050 [2]. Therefore, the lack of fresh water has become one of the major crises to be resolved. At present, several technologies are being used to address this issue. Desalination is one of the main ways to develop new fresh water resources, but the construction of this infrastructure requires a lot of money and the production process is highly energy intensive [3]. In addition, since the main arid and water-scarce areas are far inland, there selectivity of carbon capture. It was found that the R 2 values of predictive CO 2 working capacity and CO 2 /H 2 selectivity were 0.944 and 0.872, respectively. Hypothetical MOFs (hMOFs) can be automatically generated by different metals, linkers. and topologies in computer software. Wilmer et al. [28] generated 137,953 hMOFs from a library of 102 building blocks and screened 300 hMOFs with a higher capacity for methane storage than known CoRE-MOFs. Wu et al. [29] formed a new data set with 130,397 hMOFs and 37 feature descriptors including Henry's coefficient, atomic number density, and functional group number density. They found that the hMOFs with optimal methane-storage capacities exhibit φ of 0.65-0.88, VSA of~2250 m 2 ·cm -3 , etc.
The combination of machine learning and molecular simulation of HTCS has greatly increased the speed of discovering new materials [27,30], because ML suitable for specific systems will reduce the number of simulated materials, especially for the updated database of material. Recently, Pardakhti et al. [31] used the trained random forest (RF) of the ML model to predict the methane adsorption of~130,000 hMOFs. The results showed that the speed of ML was several orders of magnitude faster than traditional MS. The combination of MS and ML has developed into the current main method of screening materials. In Shi et al.'s review [7], several ML methods were considered to possess better prediction performance, such as back propagation neural network and random forest. Therefore, in this work, we selected these methods, as well as gradient boosting regression tree and neighbor component analysis, on the basis that they have good predictive performance for water harvesting on MOFs.
In the present work, we apply MC and three ML models to study the performance of water harvesting on MOFs. Based on the established structure-performance relationship, all three types of machine learning achieve a relatively good predictive effect. Then we obtained the main descriptors that played an important role in the performance of MOFs for the capture of water, and finally obtained super hydrophilic MOFs. This may provide guidance for experimental workers to synthesize available MOFs.

Molecular Models
The crystal structures of the version 2017 of 6013 CoRE-MOFs were collected and established by Chung et al. [32,33] removing the free and coordinated solvent molecules. A large crystallographic dataset of 137,953 hMOFs was designed by Wilmer et al. [28] using 102 building blocks and six different topologies. Five structural descriptors including the largest cavity diameter (LCD), pore-limiting diameter (PLD), volumetric surface area (VSA), void fraction (φ), density (ρ), and an energy descriptor of heat of adsorption (Q st ), were used to quantitatively describe the structure of the MOF. The reasons for the selection of these six descriptors are as follows: (1) they possessed the strong structure-performance relationships between gas and MOFs, confirmed by many previous works [28,32,34,35] of high-throughput calculation of MOFs, which means that these six MOF descriptors have a greater possibility of achieving the accuracy prediction in ML models than the thousands of other descriptors that could have been used; (2) these six descriptors could be applied in accuracy prediction of ML, which coincide with many ML works [36][37][38][39]; (3) these descriptors are relatively easy to measure in the experiment, and they can be used directly to guide the synthesis and application of MOF. The LCD and PLD in each CoRE-MOF were estimated using Zeo++ [40]. The VSA and φ were determined using the diameter of 0.364 nm and 0.258 nm of N 2 and He as a probe under the RASPA package, respectively [41]. The Q st was calculated by the NVT-Monte Carlo (MC) with the Widom method in RASPA under infinite dilution conditions, where N, V, and T are the number of particles, the volume of system, and the temperature of the system, respectively [41].
The partial atomic charges of MOFs were rapidly estimated and evaluated using the new MEPO-Qeq [42] method trained to reproduce density function theory (DFT), the extended electrostatic potential fitted charges using the Repeating Electrostatic Potential Extracted Atomic (REPEAT) method [43]. The LJ potential parameters of all CoRE-MOFs were obtained from the universal force field (UFF) [44], as listed in Table S1. In our previous work, it was shown that combining the force fields and MEPO-Qeq method can accurately and quickly predict the adsorption and capture of gases in various MOFs [26,45]. The force field parameters of N 2 and O 2 molecules were described by the transferable potentials for phase equilibria (TraPPE) force field [46], as demonstrated in Table S2. The TIP4P-Ew [47] model was used to simulate H 2 O molecules with LJ sites on the O and H atoms, along with the partial charges on H atoms and a dummy atom. A three-site model was applied to mimic a CO 2 molecule, which has a C-O bond length of 0.116 nm and a bond angle ∠OCO of 180 • [48]. Similarly, an N 2 molecule was modeled as a three-site model with the N-N bond length of 0.110 nm.

Monte Carlo Simulations
To capture water from the air, the Henry's constants of H 2 O, N 2 , and O 2 were calculated at 298 K using the Widom particle insertion method [49], and then the selectivity S 0[H 2 O/(N 2 +O 2 )] was calculated by the Henry's constant of three gas molecules. In this study, the MOF with the larger Henry's constant of H 2 O (K H2O ) and higher S 0[H 2 O/(N 2 +O 2 )] is regarded as excellent candidate. Notice that Henry's constant of water is calculated based on the interaction of a water molecule with the framework, which is mainly designed to simulate the extremely low water-molecule content in extreme environments such as deserts (it can be regarded as only one molecule of water in the air). It is noteworthy that, although grand canonical MC (GCMC) is an accurate estimation for the adsorption performance of MOFs, it is difficult to accurately calculate the adsorption loading of H 2 O. This is because although the structure of a water molecule is very simple, a molecular H-O-H hydrogen bond angle and dipole moment can change continuously during the adsorption process, which further complicates the adsorption [50]. Thus, the H 2 O adsorption isotherm in most adsorbing has a jump in a narrow range of vapor pressure. This jump is very difficult to calculate during the GCMC simulation. Currently, there was still not a suitable force field or H 2 O model, which could be used to screen the adsorption loading of H 2 O in most CoRE-MOFs by GCMC. After the GCMC simulation was repeatedly tested, only several MOFs could be accurately predicted with a relatively good level of agreement with the experimental isotherm [46,[50][51][52] Therefore, for a large scale of screening of CoRE-MOFs, the K i was used to calculate the adsorption selectivity of H 2 O in this work. For further explanation, see the supporting information (SI).
The simulation unit cell extended to at least 2.4 nm along each dimension, and periodic boundary conditions were applied in the three dimensions. It was assumed that the framework atoms of MOFs were rigid and fixed during the simulations. To calculate the LJ interaction, the long-range corrected spherical cut-off radius was set to 1.2 nm. The Ewald summation [53] method was used to estimate the electrostatic interaction between the frameworks and gas molecules as well as between the gas molecules. The number of MC cycles was 100,000; the first 50,000 cycles were performed to the simulation of the equilibrium system, and the last 50,000 cycles were run for ensemble averages. After testing, it was shown that the effect of increasing the MC cycle on the adsorption results was negligible. All simulations were carried out under the RASPA package [41].

Machine Learning Method
To find out which of the machine-learning (ML) models is suitable for predicting the relationship between the six descriptors (LCD, φ, VSA, PLD, ρ, Q st ) and the selectivity S 0[H 2 O/(N 2 +O 2 )] of MOFs, further information was sought by ML models. The three kinds of ML employed for the prediction of S 0[H 2 O/(N 2 +O 2 )] were random forest (RF), gradient boosting regression tree (GBRT), and neighbor component analysis (NCA), which were run in Statistics and Machine Toolbox Learning under Matlab2019a software. Because the magnitude of the selectivity data span was very large and cannot be predicted directly, it needed to be pre-processed first; that is, the value of S 0[H 2 O/(N 2 +O 2 )] was taken by the logarithm (log 10 (S 0[H 2 O/(N 2 +O 2 )] )) to narrow the enormous difference in the various data.
In this work, after the five-fold cross-validation evaluated all possible values of each parameter, three ML algorithms programmatically selected the optimal parameter values for the final calculation and prediction, in which the six descriptors were regarded as the input variable and log 10 (S 0[H 2 O/(N 2 +O 2 )] ) as output variable of ML. The key parameters were optimized by five-fold cross-validation and grid-search, as listed in Table S3. All data were divided into five folds. For each cycle, four-fifths of the data were selected randomly as a training set, and one-fifth of the data as a test set. The ML model was run five times for each group value of key parameters by the five-fold cross-validation. The average determinate coefficient (R 2 ) of test sets in five-fold cross-validation was adopted to indicate the performance of the model built by different parameter groups.
where n, y i , f i , and f i refer to the number of MOFs, simulated value, ML predicted value, and average ML predicted value, respectively.
In view of the maximum average R 2 , the optimal parameters could be automatically obtained by the strategy of parameter optimization. Except the optimized parameters, the other parameters were the default values, as listed in Table S4. Secondly, the entire data set of 6013 CoRE-MOFs was adopted to train the model with optimal parameter values. Finally, the data of 10,000 hMOFs were tested.
Among them, NCA [54,55] is a supervised learning algorithm that learns the feature weights using a diagonal adaptation. RF is made of multiple decision trees to achieve comparatively higher robustness, and its output is the average of the prediction results of multiple trees [31]. Similarly, GBRT is also an aggregation method by decision tree, which creates the optimal split criterion by continuously minimizing the least squares-regression error for the reduction of computing residual last time. More details of the three MLs are listed in the SI.

Univariate Analysis
To explore the effect of the six MOF descriptors on water harvesting performance, we used univariate analysis to understand the relationship between each descriptor (LCD, φ, VSA, ρ, PLD, and Q st ) and the selectivity S 0[H 2 O/(N 2 +O 2 )] . In Figure 1, the scale of S 0[H 2 O/(N 2 +O 2 )] is very large, because the adsorption behavior of vapor water is very special; it is different from most gases. It is a typical multilayer adsorption. The adsorption of vapor water in MOFs can be divided into two stages. Firstly, based on the interaction between vapor water and MOFs, the water molecules are gradually adsorbed in the pore wall of MOFs. Second, with the increasing of water molecules entering into the framework, strong hydrogen bonds are formed between water and water molecules, leading to remarkable multilayer adsorption. Thus, it is extremely important in the adsorption process of vapor water that the first layer of water is successfully adsorbed in MOFs. Therefore, the difference in selectivity of H 2 O between hydrophilic and hydrophobic MOFs is extremely large, leading to the data with very high value. In addition, the content of water vapor in the atmosphere is very small, especially in desert areas. The V-shaped adsorption isotherm and very high selectivity of water could be helpful to achieve the capture of H 2 O in these extreme environments. Figure 1a shows the relationship of the S 0[H 2 O/(N 2 +O 2 )] and LCD. The selectivity is close to 0 in the range of LCD less than 0.27 nm, which may be because the molecules of H 2 O with the dynamic diameter of 0.264 nm cannot enter the pores of the MOF. As the LCD continues to increase, the S 0[H 2 O/(N 2 +O 2 )] gradually decreases and eventually stabilizes at less than 1 (approximately 0.01). The S 0[H 2 O/(N 2 +O 2 )] is less than 1, indicating that the MOF does not have the ability to selectively adsorb H 2 O vapor, but preferentially adsorbs N 2 and O 2 in the atmosphere. This process reflects the change from shape selective to inverse-shape selective adsorption. The relationship of the S 0[H 2 O/(N 2 +O 2 )] and VSA is shown in Figure 1b. In the region where VSA is close to 0, it shows a higher selectivity. When VSA continues to increase, the selectivity reaches its highest point, then gradually decreases. This is because when the VSA is small, the pores of the MOF can accommodate H 2 O molecules, and when the VSA is too large, the accessible surface of all molecules in the MOF increases, so the contact probability of the N 2 and O 2 with their optimal adsorption sites increases. Therefore, the selectivity will decrease; that is, the selective separation of H 2 O vapor cannot be achieved. The super hydrophilic MOFs with high selectivity and high Henry's constants have a VSA of less than 1000 m 2 ·cm −3 , except that the VSA of HUZSUR01 is 1422.66 m 2 ·cm −3 .  . This phenomenon also appeared in the CO2 [57] adsorption and thiol capture [45] from the air in our previous works. Figure 2b plots the relationship of the S0[H2O/(N2+O2)] and the Henry's constants of water KH2O [58]. On a logarithmic scale, the scatter plot of the S0[H2O/(N2+O2)] and the KH2O shows an upward trend. The Henry's constant is a parameter that measures the affinity between the optimal adsorption site of the adsorbent and the adsorbate. The larger Henry's constant indicates that the interaction between adsorbent and adsorbate molecule is stronger, making adsorption-based separation achievable. Thus, it is necessary that the MOF with large KH2O is required to harvest H2O vapor from the air in arid areas (RH ≈ 20%). From Figures 1 and 2, the Qst seems to be the most important descriptor, and its relationship with selectivity is the most obvious. After linear, binomial, and trinomial fitting for S0[H2O/(N2+O2)]~Qst, the R 2 of binomial fitting could achieve 0.97 and remain stable by using the trinomial fitting, as is shown in Figure  S2. The deviation of two points with highest S in the linear fitting makes a relatively lower R 2 than both the binomial and trinomial fitting. In fact, the R 2 for linear fitting is only 0.93, and it has no accuracy prediction for data in the range of log S > 47. In view of  It is worth noting that as the selectivity increases, the density and void fraction change in opposite directions. This is not difficult to understand. The larger the porosity, the larger the pore volume of the MOF and the lower its density. When the ρ is less than 1260 kg·m −3 , the S 0[H 2 O/(N 2 +O 2 )] increases as the density increases, and then the selectivity decreases with the increase of the ρ. In Figure 1, the void fraction φ is mapped in the subplots as color codes. The MOFs with high selectivity have a mediumrange of φ (0.20-0.62), except for MOF HEWFUL (φ = 0.16). This is because too large and too small pores are not suitable for selective separation. The pore is too small to prevent the molecule of H 2 O from entering, thus hindering the adsorption. Conversely, if the pore is too large, the interaction between the adsorbed molecules and the MOF will be weakened, which is not conducive to selective separation. The  2 O is estimated by empirical estimation, which is usually larger than the actual size [56], so the molecule of adsorbate may be adsorbed in the MOF. Figure 2a shows that the selectivity increases with the heat of adsorption, which shows a monotonic upward trend. The trend is almost linear, indicating that the isosteric heat of adsorption and the selectivity are strongly correlated variables. When the range of Q st is 270-480 kJ·mol −1 , the MOF has its highest S 0[H 2 O/(N 2 +O 2 )] . Since we simulated the adsorption of a single H 2 O molecule in MOF at infinite dilution, so the heat of adsorption can characterize the strength of the adsorption. Therefore, Q st may be a key descriptor for determining S 0[H 2 O/(N 2 +O 2 )] . This phenomenon also appeared in the CO 2 [57] adsorption and thiol capture [45] from the air in our previous works. Figure 2b plots the relationship of the S 0[H 2 O/(N 2 +O 2 )] and the Henry's constants of water K H2O [58]. On a logarithmic scale, the scatter plot of the S 0[H 2 O/(N 2 +O 2 )] and the K H2O shows an upward trend. The Henry's constant is a parameter that measures the affinity between the optimal adsorption site of the adsorbent and the adsorbate. The larger Henry's constant indicates that the interaction between adsorbent and adsorbate molecule is stronger, making adsorptionbased separation achievable. Thus, it is necessary that the MOF with large K H2O is required to harvest H 2 O vapor from the air in arid areas (RH ≈ 20%). From Figures 1 and 2, the Q st seems to be the most important descriptor, and its relationship with selectivity is the most obvious. After linear, binomial, and trinomial fitting for S 0[H 2 O/(N 2 +O 2 )]~Qst , the R 2 of binomial fitting could achieve 0.97 and remain stable by using the trinomial fitting, as is shown in Figure S2. The deviation of two points with highest S in the linear fitting makes a relatively lower R 2 than both the binomial and trinomial fitting. In fact, the R 2 for linear fitting is only 0.93, and it has no accuracy prediction for data in the range of log S > 47. In view of the fitting, we can simply estimate and understand the structure-property relationships of MOFs for the atmospheric water harvesting. Therefore, Q st is very worthy of attention during the screening process.

Machine Learning
At present, ML has been widely used to predict the performance of materials. Through univariate analysis, only the influence of a single descriptor can be obtained, and ML can not only predict the relationship of structure-performance, but also obtain the common impact of multiple descriptors on performance. In our study, the optimal parameters were obtained by five-fold cross-validation and grid-search. The average R 2 of test sets in five-fold cross-validation was adopted to indicate the performance of the model built by different parameter groups.

Machine Learning
At present, ML has been widely used to predict the performance of materials. Through univariate analysis, only the influence of a single descriptor can be obtained, and ML can not only predict the relationship of structure-performance, but also obtain the common impact of multiple descriptors on performance. In our study, the optimal parameters were obtained by five-fold cross-validation and grid-search. The average R 2 of test sets in five-fold cross-validation was adopted to indicate the performance of the model built by different parameter groups. At present, ML has been widely used to predict the performance of materials. Through univariate analysis, only the influence of a single descriptor can be obtained, and ML can not only predict the relationship of structure-performance, but also obtain the common impact of multiple descriptors on performance. In our study, the optimal parameters were obtained by five-fold cross-validation and grid-search. The average R 2 of test sets in five-fold cross-validation was adopted to indicate the performance of the model built by different parameter groups. The final model trained by all 6013 pieces of data and optimal parameters. The results are showed in Figure 3a  To further understand the relative importance of the six descriptors for the S0[H2O/(N2+O2)], we calculated the weight of each descriptor by three MLs. The weight of the descriptors was calculated while the model was being constructed. The value of relative importance was computed by the normalization of the weight of the six descriptors, as shown in Figure 4 and Table S5. Due to the different characteristics of models, ML shows the relative importance of the descriptors in different ways. However, they all have a point of comparison, which is that the proportion of Qst is more than 50%, especially for GBRT almost only built by a variable (Qst). The order of the six descriptors is Qst > ϕ > ρ > LCD ≈ VSA > PLD. Qst seems to govern the MOF performance in this work, because the concentration of vapor water in air is close to the condition of infinite dilution. The result shows that Qst holds an absolute advantage importance relative to others, as in Section To further understand the relative importance of the six descriptors for the S 0[H 2 O/(N 2 +O 2 )] , we calculated the weight of each descriptor by three MLs. The weight of the descriptors was calculated while the model was being constructed. The value of relative importance was computed by the normalization of the weight of the six descriptors, as shown in Figure 4 and Table S5. Due to the different characteristics of models, ML shows the relative importance of the descriptors in different ways. However, they all have a point of comparison, which is that the proportion of Q st is more than 50%, especially for GBRT almost only built by a variable (Q st ). The order of the six descriptors is Q st > φ > ρ > LCD ≈ VSA > PLD. Q st seems to govern the MOF performance in this work, because the concentration of vapor water in air is close to the condition of infinite dilution. The result shows that Q st holds an absolute advantage importance relative to others, as in Section 3.1, which provides a guide for designing the best MOFs of adsorption of water vapor in the experiment.  Furthermore, the predictive ML model should be used to accelerate the new HTCS for the other MOF database. Of course, both the simple binomial/trinomial fitting and ML model could achieve this H2O-MOF system, because of the strong relativity of Qst, but ML model would possess higher universality for the other gas-MOF system. Thus, we have added the prediction of a new MOF database (137,953 hMOFs) [28] by ML model, which was trained by 6013 CoRE-MOF datasets. First, Qst was calculated for all 137,953 hMOFs, and then we selected 10,000 hMOFs with the highest Qst for the new prediction, because Qst has the highest importance. As shown in Figure 5a-c, after the predicted results were compared with simulated results by molecular simulation, R 2 of the prediction in NCA could reach 0.86. The reasons for the differences of performance between training and predicting are that there exist some differences between the CoRE-MOF and hMOF databases. For examples, there are more than 350 topologies in the CoRE-MOFs database, while there are only six topologies in the hMOFs database, which leads to a diversity gap in those databases; CoRE-MOFs contain much more open metal sites or non-skeleton ions than hMOFs [28]. In this work, the establishment and evaluation of models are finished by 6013 CoRE-MOFs. Ten thousand hMOFs are the extra data, which are different from CoRE-MOFs in some aspects and do not participate in the establishment and evaluation of models. The difference between NCA and GBRT/RF could be that GBRT overemphasizes the importance of Qst (relative importance ≈ 97% in Figure 4); that is, the GBRT model is almost only built by a variable (Qst) and RF may fail to grasp the importance of features other than Qst. Therefore, GBRT and RF may be suitable for the prediction of CoRE-MOFs but not hMOFs, which also means NCA is more universal. Nevertheless, the prediction of NCA for 10,000 hMOFs still shows the sufficient predictive ability of the model, but it is usually not as effective as the original dataset [59]. Moreover, it can be found that, when a hMOF possesses high selectivity (log10S > 5.3), the model performs very well. Thus, the ML model obtained by the CoRE-MOF database can pre-screen out low-performance MOFs to greatly reduce the running time of molecular simulation. Based on the ML algorithm, 80 hMOFs with high performance (log10S > 5.3) could be precisely screened out, and then only the selected 80 hMOFs would have to have their simulated adsorption behavior calculated, as opposed to 137,953 hMOFs, saving a considerable amount of time and computing resource. Finally, the optimal hMOFs were listed in Table S6. Furthermore, the predictive ML model should be used to accelerate the new HTCS for the other MOF database. Of course, both the simple binomial/trinomial fitting and ML model could achieve this H 2 O-MOF system, because of the strong relativity of Q st , but ML model would possess higher universality for the other gas-MOF system. Thus, we have added the prediction of a new MOF database (137,953 hMOFs) [28] by ML model, which was trained by 6013 CoRE-MOF datasets. First, Q st was calculated for all 137,953 hMOFs, and then we selected 10,000 hMOFs with the highest Q st for the new prediction, because Q st has the highest importance. As shown in Figure 5a-c, after the predicted results were compared with simulated results by molecular simulation, R 2 of the prediction in NCA could reach 0.86. The reasons for the differences of performance between training and predicting are that there exist some differences between the CoRE-MOF and hMOF databases. For examples, there are more than 350 topologies in the CoRE-MOFs database, while there are only six topologies in the hMOFs database, which leads to a diversity gap in those databases; CoRE-MOFs contain much more open metal sites or non-skeleton ions than hMOFs [28]. In this work, the establishment and evaluation of models are finished by 6013 CoRE-MOFs. Ten thousand hMOFs are the extra data, which are different from CoRE-MOFs in some aspects and do not participate in the establishment and evaluation of models. The difference between NCA and GBRT/RF could be that GBRT overemphasizes the importance of Q st (relative importance ≈ 97% in Figure 4); that is, the GBRT model is almost only built by a variable (Q st ) and RF may fail to grasp the importance of features other than Q st . Therefore, GBRT and RF may be suitable for the prediction of CoRE-MOFs but not hMOFs, which also means NCA is more universal. Nevertheless, the prediction of NCA for 10,000 hMOFs still shows the sufficient predictive ability of the model, but it is usually not as effective as the original dataset [59]. Moreover, it can be found that, when a hMOF possesses high selectivity (log 10 S > 5.3), the model performs very well. Thus, the ML model obtained by the CoRE-MOF database can pre-screen out low-performance MOFs to greatly reduce the running time of molecular simulation. Based on the ML algorithm, 80 hMOFs with high performance (log 10 S > 5.3) could be precisely screened out, and then only the selected 80 hMOFs would have to have their simulated adsorption behavior calculated, as opposed to 137,953 hMOFs, saving a considerable amount of time and computing resource. Finally, the optimal hMOFs were listed in Table S6.

Best CoRE-MOFs
According to the principle that both the Henry's constants of H2O and the selectivity of excellent MOFs are large, we selected 10 optimal CoRE-MOFs for harvesting water from the air based on the order of the selectivity of MOFs from high to low, as listed in

Best CoRE-MOFs
According to the principle that both the Henry's constants of H 2 O and the selectivity of excellent MOFs are large, we selected 10 optimal CoRE-MOFs for harvesting water from the air based on the order of the selectivity of MOFs from high to low, as listed in Table 1. Among them, the best MOF is QUTHAP, whose K H2O and S 0[H 2 O/(N 2 +O 2 )] are 2.78 × 10 124 and 4.14 × 10 128 , respectively. The range of LCD, φ, VSA, PLD, ρ, and Q st of 10 MOFs is 0.035-0.988 nm, 0. 16

Conclusions
In summary, we simulated the adsorption behaviors of H 2 O, N 2, and O 2 on 6013 CoRE-MOFs and 137,953 hMOFs by HTCS and ML. Then, after the relationships between selectivity and six MOF descriptors (LCD, φ, VSA, ρ, PLD and Q st ) were analyzed, respectively, Q st of H 2 O was shown to possess a strong correlation with the MOF ability for the capture of H 2 O. Furthermore, three ML algorithms were employed to predict the adsorption performance for each CoRE-MOF, indicating that NCA with a five-fold cross-validation accuracy of R 2 = 0.97 is the best algorithm for the prediction of selectivity and that the rank of their predictive ability is NCA > GBRT> RF. Continuously, the relative importance of the six descriptors by MLs could demonstrate that the Q st took the absolute predominance for designing MOFs with optimal selectivity of H 2 O/air. In addition, from the three models applied to predict the selectivity of hMOFs, it was found that the predicted R 2 of NCA can reach 0.86; NCA is more universal for gas-MOFs systems than other models. Finally, the ten MOFs with the best performance were screened out by the statistical methods. They were potential candidates for the capture of H 2 O from air, especially for QUTHAP. The bottomup microscopic insights obtained from this study offer experimentalists the guidelines for the development of MOFs with high performance for atmospheric water harvesting.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/nano12010159/s1, Table S1: Lennard-Jones parameters of MOFs; Figure S1: Models of N 2 and O 2 ; Table S2: Lennard-Jones parameters and charges of adsorbates; Explanation about calculation approach; Figure S2: Linear, binomial and trinomial fitting; Figures S3-S5: Details of three ML algorithms; Table S3: Type and the range of key parameters in the optimization; Table S4: Parameters of 4 ML; Table S5: Predictive importance of six descriptor by four ML algorithms; Table S6: Details of top ten hMOFs with optimal performance of water harvesting; Excel file: Top 200 CoRE-MOFs and hMOFs.   The initial selectivity of water molecules relative to nitrogen and oxygen adsorbed by MOFs.